## 
## The downloaded binary packages are in
##  /var/folders/cf/q6t0pc9n61j8hg_ycqdn0bvr0000gn/T//RtmpHr5Zw0/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/cf/q6t0pc9n61j8hg_ycqdn0bvr0000gn/T//RtmpHr5Zw0/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/cf/q6t0pc9n61j8hg_ycqdn0bvr0000gn/T//RtmpHr5Zw0/downloaded_packages
## 
## The downloaded binary packages are in
##  /var/folders/cf/q6t0pc9n61j8hg_ycqdn0bvr0000gn/T//RtmpHr5Zw0/downloaded_packages

Libraries Used

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ROCR)
library(ggplot2)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(readr)
library(xgboost)
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
session=list()
for(i in 1:18){
  file_path <- paste('/Users/akshajjoshi/AJROOT/UCDAVIS/Code/STA141AProject/Data/sessions/session', i, '.rds', sep='')
  session[[i]]=readRDS(file_path)
}

Abstract

In this study, we look deeply into how the brain’s activities are linked to behavior, such as actions, choices, and how engaged an animal is. This research gives us important new information by mapping out where in the brain these activities happen. Using advanced Neuropixels probes, we studied brain signals from about 30,000 neurons in 42 different areas of the mouse brain. This important work helps us understand how brain functions like seeing, making choices, moving, and paying attention come from different parts of the brain.

This research builds on past studies that mostly looked at single areas of the brain at a time, missing out on how different parts of the brain work together during decision-making. By using these new tools, I could look at how different brain areas work during tasks where mice had to tell the difference between visual signals. They found that while some brain activities are common across many areas when a mouse starts to move, the tasks of seeing something and deciding on an action are handled by specific areas. This helps break down old ideas and shows how complex and varied brain activities are. Their discoveries open new paths for understanding how the brain works, especially in how behaviors are controlled and decisions are made.


Section 1 Introduction

In this study, we delve into the intricate relationship between neural activity and behavioral responses using data from experiments conducted on mice. The primary focus is on the analysis of neural spike trains and their correlation with the outcomes of visual stimuli-based decision-making tasks. Utilizing a subset of data from the comprehensive study by Steinmetz et al. (2019), we concentrate on spike trains recorded from the visual cortex of four mice across eighteen experimental sessions. This research integrates advanced computational techniques, including machine learning algorithms such as XGBoost, logistic regression, and random forests, to develop predictive models that assess the likelihood of success or failure in the trials based on neural activity patterns and stimulus contrasts.

Our exploratory data analysis sheds light on the structural and functional nuances of the dataset, including the distribution of neural spikes, variability across trials, and differences among individual mice. The study not only aims to predict trial outcomes effectively but also explores the homogeneity and heterogeneity of neural responses across sessions and subjects. This approach not only enhances our understanding of the neural underpinnings of decision-making but also contributes to the broader field of computational neuroscience by applying data integration techniques and predictive modeling to elucidate complex brain-behavior relationships. ***

Section 2 Exploratory analysis

Dataframe to be used for EDA

Here is a glimpse into data we will be working with for this project. Note all the variables and categories we are working with

head(full_tibble)
## # A tibble: 6 × 13
##   brain_area region_sum_spike region_count region_mean_spike trail_id
##   <chr>                 <dbl>        <int>             <dbl>    <int>
## 1 ACA                     138          109             1.27         1
## 2 CA3                     104           68             1.53         1
## 3 DG                       79           34             2.32         1
## 4 LS                      269          139             1.94         1
## 5 MOs                     106          113             0.938        1
## 6 SUB                     156           75             2.08         1
## # ℹ 8 more variables: contrast_left <dbl>, contrast_right <dbl>,
## #   feedback_type <dbl>, mouse_name <chr>, date_exp <chr>, session_id <int>,
## #   success <dbl>, contrast_diff <dbl>

Summary of key features by Session

(session_summary)
## # A tibble: 18 × 7
##    session_id mouse_name num_trials mean_contrast_diff mean_success_rate
##         <int> <chr>           <int>              <dbl>             <dbl>
##  1          1 Cori              114              0.471             0.605
##  2          2 Cori              251              0.458             0.633
##  3          3 Cori              228              0.398             0.662
##  4          4 Forssmann         249              0.355             0.667
##  5          5 Forssmann         254              0.372             0.661
##  6          6 Forssmann         290              0.363             0.741
##  7          7 Forssmann         252              0.321             0.671
##  8          8 Hench             250              0.545             0.644
##  9          9 Hench             372              0.489             0.685
## 10         10 Hench             447              0.448             0.620
## 11         11 Hench             342              0.433             0.795
## 12         12 Lederberg         340              0.393             0.738
## 13         13 Lederberg         300              0.419             0.797
## 14         14 Lederberg         268              0.396             0.694
## 15         15 Lederberg         404              0.430             0.765
## 16         16 Lederberg         280              0.430             0.718
## 17         17 Lederberg         224              0.464             0.830
## 18         18 Lederberg         216              0.417             0.806
## # ℹ 2 more variables: unique_area <int>, mean_session_spike <dbl>

This is a data frame made to categorize the data from the full_tibble by session so we can get deeper insights into the data

Summary of Stimuli and Feedback and Total number of Neurons

Here are some quick summaries on the count of feedback type (categorized as 1 and -1), the mean contrast for left and right, and the total number of neurons present

cat("There are", total_neurons, "neurons.")
## There are 62 neurons.
(feedback_summary)
## # A tibble: 2 × 2
##   feedback_type count
##           <dbl> <int>
## 1            -1 14533
## 2             1 34640
(stimuli_summary)
## # A tibble: 1 × 2
##   mean_contrast_left mean_contrast_right
##                <dbl>               <dbl>
## 1              0.347               0.321

Looking at success rate vs contrast difference

Here we explore the relationship between mean success rate and contrast difference and can see there is a fairly postive relationship between the two. Tho it is not linear as the plot suggest, but to show the positve relationship, this worked best.

(success_vs_contrast)
## `geom_smooth()` using formula = 'y ~ x'

Exploring Neural Activity

Here through the many plots, we look at the data for neural activity in a variety of ways. A few observations made are that it seems that neural activity in general seems to go down or remian about the same as sessions increase. Additionally, most of the activity happens when mean spike rate is low. Finally, of the four mice we see Lederberg having the most success in general.

(neural_activity_by_trial)
## `geom_smooth()` using formula = 'y ~ x'

(distribution_neural)

(contrast_diff)
## `geom_smooth()` using formula = 'y ~ x'

(scatter_plot_faceted)

(neural_activity_success)

(count_by_brain_area)

Looking at change over trials

Here we look at succes rate over time, and we see a pattern of the rate decreasing over many trials for all 4 mice. Specifically, for the mouse Forssmann. This trend in success rate decreasing or remaining stagnant over trials seems to occur across all 18 sessions.

(success_trials)
## `geom_smooth()` using formula = 'y ~ x'

(success_mouse)
## `geom_smooth()` using formula = 'y ~ x'

Homogeneity and Heterogeneity Across Sessions and Mice

From these graphs we can see that the mice Hench and Lederberg have the highest distiong brain areas. The success rate graphs also support the theory that lederberg has the most success. Addiotnally, we see that the most region spikes occur in the lower number of trials for all 4 mice.

(region_mean_spike_plot)

(homogeneity_heterogeneity_plot)

(success_rate_session_plot)

(mean_spike_session_plot)

(distinct_areas_plot)

(success_rate_time_plot)

### Dimension Reducing through PCA From the pca plot we can see that most of the variance in the data is captured in the first two PCs since that is where the graph drops te most. This tells us that additional components don’t add much to the data and we should look primarily at these two dimensions

Section 3 Data integration

For data integration, each row represents a trial with trial id as well as session id. From there I include contrast left, right, and difference, average spike rate per time bins. Using these features, I will use this dataframe for as the input upon which my predictive models will be built.

## # A tibble: 6 × 45
##   session_id trail_id contrast_right contrast_left contrast_diff   bin1   bin2
##   <fct>         <int>          <dbl>         <dbl>         <dbl>  <dbl>  <dbl>
## 1 1                 1            0.5           0             0.5 0.0490 0.0368
## 2 1                 2            0             0             0   0.0300 0.0313
## 3 1                 3            1             0.5           0.5 0.0490 0.0504
## 4 1                 4            0             0             0   0.0559 0.0531
## 5 1                 5            0             0             0   0.0272 0.0436
## 6 1                 6            0             0             0   0.0490 0.0218
## # ℹ 38 more variables: bin3 <dbl>, bin4 <dbl>, bin5 <dbl>, bin6 <dbl>,
## #   bin7 <dbl>, bin8 <dbl>, bin9 <dbl>, bin10 <dbl>, bin11 <dbl>, bin12 <dbl>,
## #   bin13 <dbl>, bin14 <dbl>, bin15 <dbl>, bin16 <dbl>, bin17 <dbl>,
## #   bin18 <dbl>, bin19 <dbl>, bin20 <dbl>, bin21 <dbl>, bin22 <dbl>,
## #   bin23 <dbl>, bin24 <dbl>, bin25 <dbl>, bin26 <dbl>, bin27 <dbl>,
## #   bin28 <dbl>, bin29 <dbl>, bin30 <dbl>, bin31 <dbl>, bin32 <dbl>,
## #   bin33 <dbl>, bin34 <dbl>, bin35 <dbl>, bin36 <dbl>, bin37 <dbl>, …

Section 4 Predictive modeling

Lets look into using an xgboost model

## [1]  train-logloss:0.601631 
## [2]  train-logloss:0.550638 
## [3]  train-logloss:0.510053 
## [4]  train-logloss:0.481637 
## [5]  train-logloss:0.460702 
## [6]  train-logloss:0.441478 
## [7]  train-logloss:0.426525 
## [8]  train-logloss:0.414870 
## [9]  train-logloss:0.408038 
## [10] train-logloss:0.392890
## Accuracy:  0.7312992
##           Reference
## Prediction   0   1
##          0  76  52
##          1 221 667
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = test_label, predictor = predictions)
## 
## Data: predictions in 297 controls (test_label 0) < 719 cases (test_label 1).
## Area under the curve: 0.7379

The xgboost model shows an accuracy of 0.7379

Next, we will try a logistic regression model

## Model accuracy: 0.7214567

## F1 Score: 0.2977667
## Recall: 0.2020202
## Precision Score: 0.5660377
## Misclassification Rate: 0.2785433

Here we see, losgitic regression gives us an accuracy of .72 and a misclassifcation rate of .278 ## Using a Random Forest Model

## Model accuracy: 0.7332677
## Model Accuracy: 0.7332677
## F1 Score: 0.1958457
## Recall: 0.1111111
## Precision: 0.825
## Misclassification Rate: 0.2667323

Finally we see random forest model gives us an accuracy of .733 and a Misclassificationr ate of .267.


Section 5 Prediction performance on the test sets

test_data=list()
for(i in 1:2){
  file_path <- paste('/Users/akshajjoshi/Downloads/test/test', i, '.rds', sep='')
  test_data[[i]] <- readRDS(file_path)
}

binename <- paste0("bin", as.character(1:40))

get_trail_functional_data <- function(session_id, trail_id){
  spikes <- test_data[[session_id]]$spks[[trail_id]]
  if (any(is.na(spikes))){
    disp("value missing")
  }

  trail_bin_average <- matrix(colMeans(spikes), nrow = 1)
  colnames(trail_bin_average) <- binename
  trail_tibble  = as_tibble(trail_bin_average)%>% add_column("trail_id" = trail_id) %>% add_column("contrast_left"= session[[session_id]]$contrast_left[trail_id]) %>% add_column("contrast_right"= session[[session_id]]$contrast_right[trail_id]) %>% add_column("feedback_type"= session[[session_id]]$feedback_type[trail_id])
  
  trail_tibble
}
get_session_functional_data <- function(session_id){
  n_trail <- length(test_data[[session_id]]$spks)
  trail_list <- list()
  for (trail_id in 1:n_trail){
    trail_tibble <- get_trail_functional_data(session_id,trail_id)
    trail_list[[trail_id]] <- trail_tibble
  }
  session_tibble <- as_tibble(do.call(rbind, trail_list))
  session_tibble <- session_tibble %>% add_column("mouse_name" = session[[session_id]]$mouse_name) %>% add_column("date_exp" = session[[session_id]]$date_exp) %>% add_column("session_id" = session_id) 
  session_tibble
}

test_full_functional_tibble <- as_tibble(do.call(rbind, session_list))
test_full_functional_tibble$session_id <- as.factor(full_functional_tibble$session_id )
test_full_functional_tibble$contrast_diff <- abs(full_functional_tibble$contrast_left-full_functional_tibble$contrast_right)

test_full_functional_tibble$success <- full_functional_tibble$feedback_type == 1
test_full_functional_tibble$success <- as.numeric(full_functional_tibble$success)

Since xgboost has the highest model accuracy, I will be using that one to for modeling the test data

## Warning in predicted_labels == testing_y$success: longer object length is not a
## multiple of shorter object length
## [1] "Accuracy: 0.665026569572919"

We with our test data, we got an accuracy of 0.665.


Section 6 Discussion**

Through our initial data analysis we looked at different tresnd in spike rate over session, trials and given other parameters. From there we set out to build a predictive modeling methodology.

Overall, all three of the models tested gave solid accuracy scores and they were all very close to each opther govering around the low 70s. I think with further modeling, a deep learning nueral network or trying a series of other models would offer a lot more insight and create a better way if predicting success rates. Additionally, another important thing to consider going forward is feature engineering so we can preemptively decide whuch features are best suited for modeling. The confusion matrices also showed though that these models have a lot of room for imporvement as there were still high number of incorrect classifcations.

#Acknowledgment: USED CHATGPT for understanding assignment and code